Exercise 4; Clustering and classification

Describe the work you have done this week and summarize your learning.

date()
## [1] "Mon Nov 29 04:27:09 2021"
library(MASS)
library(plotly)
## Warning: package 'plotly' was built under R version 4.1.2
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Task 2: Explore the data

data("Boston")

dim(Boston)
## [1] 506  14
colnames(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

Boston data frame is a sample R dataset from R library MASS. The Boston data frame has 506 rows and 14 columns. This data frame contains the following columns:

  • crim = per capita crime rate by town.
  • zn = proportion of residential land zoned for lots over 25,000 sq.ft.
  • indus = proportion of non-retail business acres per town.
  • chas = Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
  • nox = nitrogen oxides concentration (parts per 10 million).
  • rm = average number of rooms per dwelling.
  • age = proportion of owner-occupied units built prior to 1940.
  • dis = weighted mean of distances to five Boston employment centres.
  • rad = index of accessibility to radial highways.
  • tax = full-value property-tax rate per $10,000.
  • ptratio = pupil-teacher ratio by town.
  • black = 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
  • lstat = lower status of the population (percent).
  • medv = median value of owner-occupied homes in $1000s.

Task 3: Graphical overview of the dataset

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
par(mfrow=c(7,2), fin=c(5,5), mar=c(2,5,2,1), las=1, bty="n")
plot(Boston$crim, ylab = "Per capita", main = "Crime Rate")

plot(Boston$zn, ylab = "lots over 25,000 sq.ft", main = "Land zoned for lots")

plot(Boston$indus, ylab = "acres per town", main = "Non-retail business acres")

plot(Boston$medv, ylab = "Value of homes", main = "Owner-occupied homes")

plot(Boston$nox, ylab = "parts per 10 million", main = "NOX concentration")

plot(Boston$rm, ylab = "number of rooms", main = "Rooms per dwelling")

plot(Boston$age, ylab = "built prior to 1940", main = "Owner-occupied units built")

plot(Boston$dis, ylab = "weighted mean", main = "Distances to employment centres")

plot(Boston$rad, ylab = "index", main = "Accessibility to highways")

plot(Boston$tax, ylab = "tax rate per 10,000", main = "Property-tax rate")

plot(Boston$ptratio, ylab = "ratio", main = "Pupil-teacher")

plot(Boston$black, ylab = "1000(Bk - 0.63)^2", main = "Proportion of blacks")

plot(Boston$lstat, ylab = "Percent", main = "Lower status")

Compared to crime rate. The higher crime rate seems to be in rented apartments, with higher property tax rate. The owners doesn’t live in the same area. Also higher NOX rate (car pollution) and closer to highway seems to be increasing the crime rate. Employment centers are far away. On the other hand the crime rate is not relevant to skin color or apartment room count.

Task 4: Standardize the dataset

# Make standardized dataset
boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
class(boston_scaled)
## [1] "matrix" "array"
boston_scaled <- as.data.frame(boston_scaled)

# Create categorical variable
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))

# Drop the crime rate from old data and add new categorial
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)

# Divide the dataset to train and test sets.
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

dim(test)
## [1] 102  14
dim(train)
## [1] 404  14

Comparing the two summaries tells us that scaling doesn’t work on the most of the variables. When scaling, there are variables which can’t be negative. After scaling these variables are negative, because no clustering has being done.

Task 5: Linear discriminant analysis on train set

lda.fit <- lda(crime ~ ., data = train)

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

classes <- as.numeric(train$crim)

plot(lda.fit, dimen = 2, col = classes, pch = classes, xlab = "Crime Rate", ylab = "Other factors")
lda.arrows(lda.fit, myscale = 2)

From this data we can see that accessibility to radial highways affects the crime rates the most. Nitrogen oxide concentration also has a noted link in the data for incrieasing the crime rate. Instead when proportion of residential land zones increases, the crime rates are lower.

Task 6: Predict the test dataset crime variable with LDA model

crimeCat <- test$crime
test <- dplyr::select(test, -crime)

lda.pred <- predict(lda.fit, newdata = test)
table(correct = crimeCat, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       19      13        1    0
##   med_low    3      14        7    0
##   med_high   1       1       10    1
##   high       0       0        0   32

High crime rate predictions are quite consent and it seems to be more variance on the low and med_low predictions.

Task 7: Standardize the original Boston dataset

data("Boston")
Boston <- scale(Boston)

km <-kmeans(Boston, centers = 4)
pairs(Boston, col = km$cluster)

I came up with four clusters. The reason for this is that from the charts the the clusters are not that dispersed and there are big centers from all the charts, which can be combined. With five clusters, there were one too small cluster or it was too dispersed. With three, there was even clusters, but with four there are also four quite even clusters. More is better.

Super Bonus: 3d Matrix

model_predictors <- dplyr::select(train, -crime)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km$centers)

Both give the same value, but with crime, the values are colored.